arXiv:1506.01520v3 [stat.ML] 15 Dec 2015 


An Average Classification Algorithm 


Brendan van Rooyen 3 (previously l/2 ) 

Aditya Krishna Menon 2,1 Robert C. Williamson 1,2 

1 The Australian National University 2 National ICT Australia 
3 The Queensland Univerity of Technology 


"While the truth is rarely pure, it can he simple." 

- Oscar Wilde via Robert C. Williamson, The Importance of Being Unhinged 

m 

In the problem of binary classification, the goal is to learn a classifier 
that accurately predicts an instances corresponding label. Many cutting 
edge classification algorithms, such as the support vector machine, logistic 
regression, boosting and so on, output a classifier of the form, 

n 

f(x) = sign(£a/y f K(x,Xf)), (1) 

i=i 

with Oij > 0, = 1 and K(x,x') a function that measures the similarity 

of two instances x and x'. Algorithmically, optimizing these weights is 
a difficult problem that still attracts much research effort. Furthermore, 
explaining these methods to the uninitiated is a difficult task. Letting all 
the DCi be equal in [T| leads to a conceptually simpler classification rule, one 
that requires little effort to motivate or explain: the mean, 

1 dk 

f(x) =sign(-23y f K(x f ,x)). 

1 i=i 

The above is a simple and intuitive classification rule. It classifies by the 
average similarity to the previously observed positive and negative in¬ 
stances, with the most similar class being the output of the classifier. It 
has been studied previously, for example in chapter one of l35l and fur¬ 
ther in fl6ll36 . 28; j5j. The main drawbacks of the mean classification rule 
are prohibitive storage and evaluation costs. In fact, this is the motivation 
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given for the support vector machine (SVM) in ||35| . Our goal here is to 
reinvigorate interest in this very average algorithm. 

The chapter proceeds as follows: 

• We argue for the mean classifier, showing it is the ERM solution for 
a classification calibrated loss function (theorems [l] and |2j. 

• We explore the robustness properties of the mean classifier. We relate 
the noise tolerance of the mean algorithm to the margin for error in 
the solution (theorem j6jl. Finally we show, in a certain sense, the 
mean classifier is the only surrogate loss minimization method that 
is immune to the effects of symmetric label noise (theorem [14} . 

• Finally, we show how to sparsely approximate any kernel classifier 
through the use of kernel herding [441 [9[ 0. This produces a sim¬ 
ple, understandable means of choosing representative points, with 
provable rates of convergence. 

The result is a conceptually simple algorithm for learning classifiers, that 
is accurate, easily parallelized, robust and firmly grounded in theory. 

1 Basic Notation 

Denote by H an abstract Hilbert space, with inner product (v\,V 2 )- For 
any v G PL, denote by v the unit vector in the direction of v. We work with 
loss functions £:{ — 1, l}xR-»IR. 

2 Kernel Classifiers 

Fet X be the instance space and Y = {—1,1} the label space. A classifier is 
a bounded function / G R v , with f(x) the score assigned to the instance 
X and sign(/(x)) the predicted label. For a distribution P G P(X x Y), 
we define the Bayes optimal classifier to be the classifier fp(x ) = 1 if 
P(Y = 11X = x) > j and —1 otherwise. We measure the distance be¬ 
tween classifiers via the supremum distance, 

||/ -/'||oo = sup |/(*)-/'(*)|. 
xex 

A classification algorithm is a function A : U“ =1 (X x Y) n ->■ 1R X , that 
given a training set S outputs a classifier. Define the misclassification loss 
£ 0 i(y,v) = [ yv < 0]. Note, that £oi(l//0) = 1 always. An output of zero can 
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be viewed as abstaining from choosing a label. For any loss t : Y x IR —> R, 
we remind the reader of the risk and sample risk of / defined as, 

t(P,f) := E ( X ,y)~pt(y,f(x)) and £{S,f) := ^ £ £{y,f(x)), 

11 (*,y)es 

respectively. Good classification algorithms should output classifiers with 
low misclassification risk. Many classification algorithms, such as the 
SVM, logistic regression, boosting and so on, output a classifier of the 
form, 

n 

fix) = Y^ 0 iiijjK(x,Xi), 
i =1 

with oti > 0, Yi&i = 1 and K(x,x') = (<£(*), </>(x')) a kernel function, an 
inner product of feature vectors in a (possibly infinite dimensional) feature 
space. For simplicity, we use the mean, 

fi X ) = \TjVi K i X i'X)- (2) 

1 i=i 

3 Why the Mean? 

The mean is not only an intuitively appealing classification rule, it also 
arises as the optimal classifier for the linear loss, considered previously in 

1311321. Let, 

4near(y,*0 = 1 ~ yv. 

If V G {-1,1}, then 4l (y,v) = linear iy,v). Allowing v G [-1,1] pro¬ 
vides a convexification of misclassification loss. For v G [—1,1], t'oi (y, v) < 
^linear (}/r v ) ■ Furthermore, we have the following surrogate regret bound. 

Theorem 1 (Surrogate Regret Bound for Linear Loss). For all distributions 
P, 

fp = argmin 4 ^,. (P,/) G argmin£ 0 i (P,f)- 

f e[~UP 

Furthermore for all f G [—1, l] x , 

eoi(p,f)-toi(p,fp)< ^linear (P/ f) ^linear ( P/ fp ) • 

By theorem [I] linear loss is a suitable surrogate loss for learning classi¬ 
fiers much like the hinge, logistic and exponential loss functions Il34l . As 
is usual, rather than minimizing over all bounded functions, to avoid over¬ 
fitting the sample we work with a restricted function class. For a feature 
map cp : X —> FL, define the linear function class, 

:= {/«(*) = (cv,(p(x )) : co G H }, 
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and the bounded linear function class. 


■= {fco{x) = (a \(p{x)) : (o G U, ||a;|| < r} . 


We will assume throughout that the feature map is bounded, \\(p(x) || < 1 for 
all x. As shorthand we write £(P,co) := l ( P, f w ) . By the Cauchy-Schwarz 
inequality C [— r,r] x . As a surrogate to minimizing £oi(P,f) over all 
functions, we will minimize linear (S,/) over / G J 7 ^. 

For any sample S G U“ =1 (X x Y)" define the mean vector 00 $ = K £ i/</>(x). 

(x,y)6S 

The mean classifier (J 2 J can be written as f(x) = (ay,</>(*))■ 

Theorem 2 (The Mean Classifier Minimizes Linear Loss). 

\ 

dy = argmin — ^ 1 - y(cv,<p(x)) = argminl - (a;, ay) 

' ' (x,y)GS 

with minimum linear loss given by 1 — ||o^s||* Furthermore classifying using 
(<*>s,<p(x)) is equivalent to classifying according to equation ([2]). 

This has been noted in ||39l , we include it for completeness. The proof 
is a straight forward application of the Cauchy-Schwarz inequality As 
ay = /lay, A > 0, they both produce the same classifier. Changing the 
norm constraint to ||a’|| < r merely scales the classifier, and therefore does 
not change its misclassification performance. The quantity. 


"s || 2 = ^|2 E E yy' K i x ’ x ')> 

Km ( _\ tc 


{x,y)es (x',y')es 


can be thought of as the "self-similarity" of the sample. For a distribution 
P, define cop = ~E( xy ^ P y(p(x). It is easily verified that, 

cb P = argminE( xy ^pl — y(cv,(p(x)) = argminl — (co,cop). 


Furthermore, we have the following generalization bound on the linear 
loss performance of ay. 

Theorem 3. For all distributions P and for all bounded feature maps (p : X —> PL, 
with probability at least 1 — 3 on a draw S~P n , 
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This theorem is a special case of a more refined result, proved in the 
appendix to this chapter. In Smola et al. Il38l . bounds for the error in 
estimating the mean are presented. 


Theorem 4. For all distributions P and for all bounded feature maps <p : X —» PL, 
with probability at least 1 — 5 on a draw S~P n , 


COp — COg 




The proof is via an appeal to standard Rademacher bounds. 


3.1 Relation to Maximum Mean Discrepancy 

Let P± G F(X) be the conditional distribution over instances given a pos¬ 
itive or negative label respectively. Define the maximum mean discrepancy 

m, 

i i 

MMD^(P + ,P_) := max -\E x ~ P+ (io,<p(x)) -E x ~ P _(co,(p(x))\ = - \\cv P+ - 

a>:||w||<l -2 -2 

MMD^,(P + ,P_) can be seen as a restricted variational divergence, 

V(P+,P_) = max -\E x ~ P f(x)-'E x ~p_f(x)\ / 

/e[-i,i] x 2 

a commonly used metric on probability distributions, where / 

[— 1,l] x . Define the distribution P 6 P(X x Y) that first samples y uni¬ 
formly from { — 1,1} and then samples x~P y . Then, 

MMDrf,(P + ,P-) = max \E/ xv \^ P (to,y(p(x))\ = \\cv P \\ . 

Therefore, if we assume that positive and negative classes are equally 
likely, the mean classifier classifies using the a ; that "witnesses" the MMD, 
i.e. it attains the max in the above. 


3.2 Relation to the SVM 

For a regularization parameter A, the SVM solves the following convex 
objective, 

1 A 

argmin— £ [1 - y(co,(p{x))} + +'-\\cv\\ 2 , 

ueH PI ( x ,y)eS 

where [x]+ = max(x,0). This is the Lagrange multiplier problem associ¬ 
ated with, 

\ 

argmin M ^ t 1 ~y( cv ' ( P( x ))}+- 

o;:||a;|| 2 <c ' ' ( x,y)eS 
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If we take c = 1, by Cauchy-Schwarz [1 — y(cv,(p(x))] + = 1 — y(u>,<p(x)) 
and the above objective is equivalent to that in theorem [2] . The mean 
classifier is the optimal solution to a highly regularized SVM, and is there¬ 
fore preferentially optimizing the margin over the sample hinge loss. Prior 
evidence exists showing that feature normalisation (which is high regu¬ 
larization in disguise) increases the generalisation performance of SVM's 

nm 


3.3 Relation to Kernel Density Estimation 

On the surface the mean classifier is a discriminative approach. Restricting 
to positive kernels, such as the Gaussian kernel, it can be seen as the follow¬ 
ing generative approach: estimate P with P, with class conditional distribu¬ 
tions estimated by kernel density estimation. Letting S± = {(x, ±1)} C S, 
and take, 

P(X = x|Y = ±l)oc -L £ K{x,x') 

l S± l *'eS± 

and P(Y = 1) = To classify new instances, use the Bayes optimal 
classifier for P. This yields the same classification rule as Q. This is the 
"potential function rule" discussed in 061 


3.4 Extension to Multiple Kernels 

To ensure the practical success of any kernel based method, it is important 
that the correct feature map be chosen. Thus far we have only considered 
the problem of learning with a single feature map, and not the problem 
of learning the feature map. Given k feature maps <pi : X —> 'H,, i E [i ;*], 
multiple kernel learning 0}[25, llj considers learning over a function class 
that is the convex hull of the classes 


T := j/(x) = £a;(o4<fr(*)) 



< 1 ,&i > 


0, Y2 ^ 

i=i 



By an easy calculation. 


mm 

feT |S 


TGT E 1 - yf ( x ) = minfl - 


(x,y)eS 


HU] 


(Vc 


where cv l s is the sample mean in the z-th feature space. In words, we pick 
the feature space which minimizes 1 — 11 co l s 11. This is in contrast to usual 
multiple kernel learning techniques that do not in general pick out a single 
feature map. Furthermore, we have the following generalization bound. 
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Theorem 5. For all distributions P and for all finite collections of bounded feature 
maps (pi : X —> "Hi, i E [1 ;k\ , with probability at least 1 — 5 on a draw S~P n , 


^linear — ^linear ( S/ COg 



(1 + logW+logQ)) 


where dig corresponds to the mean that minimizes 1 — ||cug||. 

Like theorem |3j this is a specific case of a more refined bound presented 
in the appendix to this chapter. 


4 The Robustness of the Mean Classifier 

Here we detail the robustness of the mean classifier to perturbations of 
P. We do not consider the statistical issues of learning from a corrupted 
distribution. We first show that the degree to which one can approximate 
a classifier without loss of performance is related to the margin for error of 
the classifier. We then discuss robustness properties of the mean classifier 
under the n-contamination model of Huber [24). Finally we show the im¬ 
munity of the mean classifier to symmetric label noise. 

The results of this section only pertain to linear function classes. In the 
following section we consider general function classes. We show that in 
this more general setting, linear loss is the only loss function that is robust 
to the effects of symmetric label noise. 

4.1 Approximation Error and Margins 

Define the margin loss at margin 7 to be £ 7 (y,v) = \yv < 7 J. The margin 
loss is an upper bound of misclassification loss. For 7 = 0, £ 7 = £q \ . The 
margin loss is used in place of misclassification loss to produce tighter 
generalization bounds for minimizing misclassification loss [20,1 23, [37] • 
For a classifier / to have small margin loss it must not just accurately 
predict the label, it must do so with confidence. Maximizing the margin 
while forcing £ 7 (S, to) = 0 is the original motivation for the hard margin 
SVM fl2l . Here we relate the margin loss of a classifier / to the amount of 
slop allowed in approximating /. 

Theorem 6 (Margins and Approximation). £ e {P,f) < tx if and only if£oi(P/f) 
a. for all f with ||/ — / 1 | < e. 

The margin for error on a distribution P of a classifier / is given by, 

F (P,f) := sup {7 : £ y (P,f) = £ 01 (P,/)}. 
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For a sample S, setting e < T(S,f) ensures, 

^01 (S, CVs) = £e{S,<X>s) = ^01 (S, U>s)- 

The margin therefore provides means of assessing the degree to which 
one can approximate a classifier; the larger the margin the greater error 
allowed in approximating the classifier. 

4.2 Robustness under (7-contamination 

Rather than samples from P, we assume the decision maker has access to 
samples from a perturbed distribution, 

P=(l-tr)P + a Q, 

with Q the perturbation or corruption. We can view sampling from P as 
sampling from P with probability 1 — a and from Q with probability a. It 
is easy to show that top = (1 — a)to P + ctouq. Furthermore, 

||top — top || = a || top — cuq|| . 

Lemma 7. If cr \\top — to>Q\\ <T(P,to P ) then £ 01 (P,oj p ) = £ m (P,tOp). 

Hence the margin provides means to assess the immunity of the mean 
classifier to corruption. Furthermore, as ||o;p — cpq|| < 2, if a < 
then the mean classifier is immune to the effects of any Q. We caution the 
reader that lemma [7] is a one way implication. For particular choices of Q, 
one can show greater robustness of the mean classifier. 

4.3 Learning Under Symmetric Label Noise 

Here we consider the problem of learning under symmetric label noise 
m. Rather than samples from P, the decision maker has access to samples 
from a corrupted distribution P a . To sample from P a , first draw (x, i/)~P 
and then flip the label with probability cr. This problem is of practical 
interest, particularly in situations where there are multiple labellers, each 
of which can be viewed as an "expert" labeller with added noises. We can 
decompose. 

Pa = (1 — <t)P + crP', 

where P' is the "label flipped" version of P. It is easy to show to P > = —to P . 
Therefore <jO P(t = (1 — 2cr)to P . 

Theorem 8 (Symmetric Label Noise Immunity of the Mean Classifier). Let 
P cr be P corrupted via symmetric label noise zvith label flip probability a. Then for 
all a E (0, |), £oi(P,to P ) = 
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The proof comes from the simple observation that as cop and (Vp ir are 
related by a positive constant, they produce the same classifier. This result 
greatly extends previous results in It36l l28l on the symmetric label noise 
immunity of the mean classification algorithm, were it is assumed the 
marginal distribution over instances is uniform on the unit sphere in IR". 

4.4 Other Approaches to Learning with Symmetric Label Noise 

A large class of modern classification algorithms, such as logistic regres¬ 
sion, the SVM and boosting, proceed by minimizing a convex potential or 
margin loss over a particular function class. 

Definition 9. A loss i is a convex potential if their exists a convex function 
ip : R —» IR with ip(v) > 0, ip'( 0) < 0 and lim^oo t p(v) = 0, with, 

i{y,v) = xp(yv). 

Long and Servedio in |[29l proved the following negative result on what 
is possible when learning under symmetric label noise: there exists a sepa¬ 
rable distribution P and function class J~ where, when the decision maker 
observes samples from P a with symmetric label noise of any nonzero rate, 
minimisation of any convex potential over T results in classification perfor¬ 
mance on P that is equivalent to random guessing. The example provided 
in |[29] is far from esoteric, in fact it is a given by a distribution in R 2 that 
is concentrated on three points with function class given by linear hyper¬ 
planes through the origin. We present their example in section [9j 

Ostensibly, this result establishes that convex losses are not robust to sym¬ 
metric label noise, and motivates using non-convex losses Il40l 131,17. US 
l30l . These approaches are computationally intensive and scale poorly to 
large data sets. We have seen in the previous that linear loss, with func¬ 
tion class Tp (for any feature map <p), is immune to symmetric label noise. 
Furthermore, minimizing linear loss is easy. We show in the following 
section that linear loss minimization over any function class is immune to 
symmetric label noise. 

An alternate means of circumventing the impossibility result in Il29l is 
to use a rich function class, say by using a universal kernel, together with 
a standard classification calibrated loss. While this approach is immune 
to label noise, performing the minimization is difficult. By theorem [TJ for 
sufficiently rich function classes, using any of these other losses will pro¬ 
duce the same result as using linear loss. 
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Finally, if the noise rate is known, one can use the method of unbiased 
estimators presented by Natarajan et al. l33l and correct for the corrup¬ 
tion. The obvious drawback is in general, the noise rate is unknown. In the 
following section we explore the relationship between linear loss and the 
method of unbiased estimators. We show that linear loss is "unaffected" 
by this correction (in a sense to be made precise). Furthermore, linear loss 
is essentially the only convex loss with this property. 


5 Symmetric Label Noise and Corruption Corrected 
Losses 


The weakness of the analysis of section 4.3 was that it only considered 
linear function classes. Here we show that linear loss minimization over 
general function classes is unaffected by symmetric label noise, in the sense 
that for all a £ (0, \) and for all function classes T C R x , 


arg minE( t y^p£ liriear (i/,/(x)) = arg min E ( ^p/ linea r(i/,/(*))• 
feJ r feT 


For the following section we work directly with distributions Q E P(R x 
Y) over score, label pairs. Any distribution P and classifier / induces a 
distribution Q(P,f) with. 


E Ky)~Q(P,/)-%W) = E {x ,y)~ P t(y,f(x)). 

A loss £ provides means to order distributions. For two distributions Q, Q', 
we say Q < e Q ' if. 


E(i),y)~Q^(l/W) — E(o,y)~Q'^(y, ■ 

If Q = Q(P,fi) and Q' = Q(P,/ 2 ), the above is equivalent to, 

E(x,y)~p%,/iO)) < E (Xry) ^ P £(y,f 2 {x)), 

i.e., the classifier f\ has lower expected loss than f 2 - The decision maker 
wants to find the distribution Q, in some restricted set, that is smallest in 
the ordering <g. Denote by Q a , the distribution obtained from drawing 
pairs (u,i/)~Q and then flipping the label with probability a. In light of 
Long and Servedio's example, there is no guarantee that, 

Q <e Q' ^ Qa <i Q' a . 

The noise might affect how distributions are ordered. To progress we seek 
loss functions that are robust to label noise. 
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Definition 10. A loss £ is robust to label noise if for all a e (<U), 

Q <( Q' o Qa <e Qa- 

In words, the decision maker correctly orders distributions if they as¬ 
sume no noise. Robustness to label noise easily implies, 

argminE {x , y )~ P £(y, f(x)) = argminE ( x , y )~ P /(y, f(x)), 
feF feJ 

for all T. Given any j G (0, f), Natarajan et al. showed in l33l how to 
correct for the corruption by associating with any loss, a corrected loss, 

(1 -<r)l{y,v) -<r£(-y,v) 


e<r{y,v) = 


1-2 a 


with the property, 

E [v,y)~Q^y> v ) = ^(v,y)~Q./o-{y,v), VQ G P(R X Y). 

Robustness to label noise can be characterized by the order equivalence of 
£ and £ a . 

Definition 11 (Order Equivalence). Two loss functions £\ and £2 are order 
equivalent if for all distributions Q, Q’ G P(IR x Y), 

Q Q! ^ Q <i z Q'- 

Lemma 12. If £ is robust to label noise if and only if for all a G £ and £ a 

are order equivalent. 

In words, the decision maker correctly orders distributions if they in¬ 
correctly assume noise. Following on from these insights, we now charac¬ 
terize when a loss is robust to label noise. 

Theorem 13 (Characterization of Robustness). Let £be a loss zvith £(— 1, v) f=- 
£(1, v). Then £ is robust to label noise if and only if there exists a constant C such 
that, 

£{l,v) +£(—l,v) = C, Vu G IR. 

Ghosh et al. in fl~ 8 l prove a one way result. Misclassification loss 
satisfies the conditions for theorem [13] however it is difficult to minimize 
directly. For linear loss, 

£(1, v) + £(—\,v) = 1 — v + l+v = 2. 

Therefore linear loss is robust to label noise. Furthermore, up to equiva¬ 


lence, linear loss is the only convex function that satisfies 13 


Theorem 14 (Uniqueness of Linear Loss). A loss £ is convex in its second 
argument and is robust to label noise if and only if there exists a constant A and 
a function g : Y —> R such that, 

£(y,v) = A yv + gfy). 
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5.1 Beyond Symmetric Label Noise 

Thus far we have assumed that the noise on positive and negative labels 
is the same. A sensible generalization is label conditional noise, were the 
label y G { — 1,1}, is flipped with a label dependent probability. Following 
Natarajen et al. [f33l , we can correct for class conditional label noise in the 
same way we can correct for symmetric label noise, and use the loss, 

(1 -o-- y )l(y,p)-(7-yl(-y,p) 

1 - a-i - £71 

If the decision maker knows the ratio °— 1 , then for a certain class of 

o 1 

loss functions they can avoid estimating noise rates. 

Theorem 15. Let £ be a loss ivith + <j-i£(\,v) = C for all v G R. 

Then £ a lAl and £ are order equivalent. 

For linear loss, 

C\{\ + v) + <7_i(l - v) = (7i + a -1 + (oi - 

which is not constant in v unless (7i = CT-\. Linear (and similarly misclas- 
sification loss) are no longer robust under label conditional noise. This 
result also means there is no non trivial loss that is robust to label con¬ 
ditional noise for all noise rates <j-\ + cq <1, as linear loss would be a 
candidate for such a loss. 

Progress can be made if one works with more general error measures, 
beyond expected loss. For a distribution P G P(X x Y), let P± G P(X) 
be the conditional distribution over instances given a positive or negative 
label respectively. The balanced error function is defined as, 

BER e (P+,P-,f) = ^E X ~ P+ £(1 .,/(*)) + ^E x ~ P J(-l,f(x)). 

If both labels are equally likely under P, then the balanced error is exactly 
the expected loss. The balanced error "balances" the two class, treating er¬ 
rors on positive and negative labels equally. Closely related to the problem 
of learning under label conditional noise, is the problem of learning un¬ 
der mutually contaminated distributions, presented in Menon et al. |32l . 
Rather than samples from the clean label conditional distributions, the de¬ 
cision maker has access to samples from corrupted distributions P±, with, 

P+ = (1 — «)P + + DiP- and P_ = /3 P + + (1 — [5)P _, oc + /3 < 1. 

In words, the corrupted P y is a combination of the true P,/ and the un¬ 
wanted P-y. We warn the reader that a and /I are not the noise rates on 
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the two classes. However, in section 2.3 of Menon et al. Il32l , they are 
shown to be related to by an invertible transformation. 

Theorem 16. Let £ be robust to label noise. Then, 

BER c{P+,P-,f) = {l-oc- /S)BER e (P+,P-,f) + 
for some constant C. 

This is a generalization of proposition 1 of Menon et al. [|32| . that 
restricts to misclassification loss. Taking argmins yields, 

argminBER .^ (?+,?_,/) = argminBER e (P +/ P-,f). 

feF feF 

Thus balanced error can be optimized from corrupted distributions. 

Going further beyond symmetric label noise, one can assume a general 
noise process with noise rates that depend both on the label and the ob¬ 
served instance. Define the noise function a : X x Y —> [0, j), with a(x,y) 
the probability that the instance label pair (x,y) has its label flipped. 
Rather than samples from P, the decision maker has samples from P a , 
where to sample from P a first sample (x,y)~P and then flip the label with 
probability a(x,y). The recent work of Gosh et al. HT8l proves the follow¬ 
ing theorem concerning the robustness properties of minimizing any loss 
that is robust to label noise. 

Theorem 17. For all distributions P, function classes T, all noise functions 
a : X x Y —» [0, |) and all loss functions £ that are robust to label noise, 

£(p f*) < _ ) _ 

' U min^y) 1 — 2a(x,y)' 

where f* arid f* are the minimizers over LF of £(P a ,f) and £(P,f ) respectively. 

Our proof of this theorem is a slight modification of the discussion 
that follows remark 1 in Ghosh et al. HT8| . There they only consider 
variable noise rates that are functions of the instance. We include it for 
completeness. In particular, this theorem shows that if £(P,f*) = 0 and 
maxy^ cr(x,y) < then minimizing £ with samples from P IT will also 
recover a classifier with zero loss against the clean P. 

6 Herding for Sparse Approximation 

The main problem classifying according to [2] is the dependence of the clas¬ 
sifier on the entire sample. We show how to correct this. We first survey 
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Data: Distribution P G P(Z), set of possible representative points 
SCZ, kernel function 1C and error tolerance e. 

Result: Weighted set of representives H = {(a !/ z,)}” =1 such that 

cop — YL fli/Kz) < e. 

(u,z)eH 

Initialization: z* = argmax_., eS E z ~p/C(z,z / ), H = {(l,z*)} ; 


while 


> e do 


cop- £ cap(z) 

(x,z)eH 

Let z* = argmax 7 , 6S E z ~ P /C(z,z') — a/C(z,z') ; 

(tx,z)EH 


Set A* = argmin A6fol1 co P - ( (1 — A) £ uip(z) + Aip(z) 

(u,z)eH 

Multiply all weights in H by 1 — A* ; 

Add (A *,z*) to H 


end 


Algorithm 1: Pseudo-code specification of Herding. 


the technique of herding, before showing how it can be applied in estimat¬ 
ing the classifier of [ 2 ] 

For any set Z, mapping ip : Z —> PL and distribution P G P(Z), define 
the mean co P = 1E z ~ P ip(z). We recover our previous definition by taking 
Z = X x Y and tp(x,y) = y<p{x). Given a set of examples S = {tp(z/) }f =1 , 
herding [441 130 is a method to sparsely approximate to P with a combina¬ 
tion of the elements of S. In @ it was shown that herding is an application 
of the Frank-Wolfe optimization algorithm to the convex problem, 

• II ~ 112 

mm cup — a; , 

u>EC 

where, C = co ({ip(z) : z G S}). Define the kernel /C(z,z') = (tp(z),ip(z / )). 
Herding proceeds as in algorithm 1. Intuitively, herding begins by select¬ 
ing the point in S that is most similar on average to draws from P, as 
measured by 1C. When selecting a new representative, herding chooses 
the point in S that is most similar on average to draws from P while being 
different from previously chosen points. If herding runs for m iterations, then 
an approximation of co P with only m elements is obtained. One can also 
take A* = R 1 1 , leading to uniform weights. 

Herding can also be viewed as minimizing MMD^,(P,Q), where the ap¬ 
proximating distribution Q is concentrated on S 0. Originally, herding 
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was motivated as means to produce "super samples" from a distribution 
P. Standard monte-carlo techniques lead to convergence at rate of 
the square error \\cop — <x)\\ —> 0. Using herding, under certain conditions 
faster rates can be achieved. For our application, P is the empirical dis¬ 
tribution over the set S, ||| £ :e s <5 Z / or equivalently cos = jyr Hzes *K z )- 
As we will see, herding converges rapidly: Oflogf ^)) iterations gives an 
approximation of accuracy e. 

The expression for the optimal A* is available in closed form, see section 
4 of Bach et al. [3j. More exotic forms of the Frank Wolfe algorithm exist. 
In fully corrective methods, the line search over A is replaced with a full 
optimization over all current points in the herd l26| . Away step methods 
consider both adding a new member to the herd as well as deleting a cur¬ 
rent member (221- These more involved methods can be used in place of 
algorithm 1 . 

6.1 Rates of Convergence for Herding 

Let Co m be the approximation to cvp obtained from running herding for m 
iterations. As discussed previously, herding can be used as a means of 
sampling from a distribution, with rates of convergence ||cup — d)|| —> 0 
faster than that for random sampling. While in the worst case, one can not 
do better than a rate, if cop E C faster rates can be obtained 0. Let D 
be the diameter of C and d the distance from cvp to the boundary of C. For 
herding with line search as in algorithm 1 , 

II 6 Op — tu m \\ < ||cop — d>i || e~ Km , 

where a = 2( ^ +D) 0- For our application <x> P = = \ £ zeS ip(z) 

which is clearly in C, furthermore d > 0. Hence the herded approximation 
converges quickly to co$- 

6.2 Computational Analysis of Herding 

The main bottleneck of the herding algorithm is the population of the ker¬ 
nel matrix, which runs in time of order n 2 . Like most greedy algorithms, 
to calculate each iteration of the herding algorithm, only knowledge of 
the previously added point is required. Therefore, each iteration runs in 
order n. One can avoid calculating the entire kernel matrix by estimating 
E z ~p/C(z,z'). This reduces the initialization time to order n, at the cost of 
extra time per iteration required to calculate the kernel between the newly 
added point and all the elements of the sample. 
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There exists many tricks to speed up the training of SVM's 11421 . In section 


6.6 we show how these methods can be applied to herding. 


6.3 Parallel Extension 


It is very easy to parallelize the herding algorithm. Rewriting the mean as 
a "mean of means", one has. 


1 

n 


n 


m 


Ev’C z d = E 


TH 

n 



X>Oi/) 


;=i 


where we have split the n data points into m disjoint groups with z, ; the 
j-th element of the z-th group. We can use herding to approximate each 
sub mean 4. Ylj' = \ separately. Furthermore, if we approximate each 

sub mean to tolerance e, combining the approximations yields an approx¬ 
imation to the total mean with tolerance e. 


Lemma 18 (Parallel Means). Let to = £A;07; with A, > 0 and £A; = 1. 
Suppose that for each i there is an approximation uJi with ||a>; — d);|| < e. Then 
||cu — EAitDill < e - 


The proof is a simple application of the triangle inequality and the 
homogeneity of norms. Lemma 18 allows one to use a map-reduce algo¬ 
rithm to herd large sets of data. One splits the data into M groups, herds 
each group in parallel and then combines the groups, possibly herding the 
result. 


6.4 Discriminative Herding for Approximating Rule [2] 

Our goal is to approximate equation (j2j, which in turn means approximat¬ 
ing cos- To this end, we run herding on the sample S. Let ip : X x Y — > 
TL, with ip(x,y) = ytp(x) and corresponding kernel JC((x,y), (x',y')) = 
yy'K(x,x'). We take, 

&s = E 

{u,(x,y))eH 

where H is the representative set (or herd) of instance, label pairs ob¬ 
tained from herding S to tolerance e. Our approximate classifier is f{x) = 
( oos,<p{x)). We have by a simple application of the Cauchy-Schwarz in¬ 
equality, 

11/ — Zlloo = su p|(o;s-d>s,</>(x))| <e. 

Hence the tolerance used in the herding algorithm directly controls the 
approximation accuracy. 
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6.5 Comparisons with Previous Work 

Herding has appeared under a different name in the field of statistics, in 
the work of Jones H27l . There an algorithm closely related to the Frank- 
Wolfe algorithm (projection pursuit) is considered, and rates of conver¬ 
gence of for the general case when cop £ C are proved. The appendix 
of Ii23l features a theoretical discussion of sparse approximations. Her- 
brich and Williamson in [|23l show the existence of a /n-sparse approxima¬ 
te m (S) 

tion with ||cu — d)|| < ——, with (S) the entropy numbers of the set S. 
We further explore the connections to their approach in the appendix to 
this chapter. 

6.6 Comparing Herding to Sparse SVM Solvers 

Recall that the SVM solves the following convex objective, 

1 A 

argmin— [l - y(co,<p(x))} + +'-\\co\\ 2 . (3) 

iveH PI ( x ,y)eS z 

There are many approximate, "greedy", methods to attack this problem. 
These methods are deeply related to Frank Wolfe algorithms [42, [Tl ITOl . 
Here we show the connection of these methods to kernel herding. It is 
well known that the optimal solution to the SVM objective [3] is of the 
form, 

n 

co = 'jr,eciyi<p{xi), oci > 0 Vz G [1 ;n]. 

i =1 

Let C = co ({y<p(x) : (x,y) G S}). If we normalize the a,-, i.e. take Ya=\ K i = 
1 (which does not change the outputted classifier), then co G C. For all 
co G C, lieu’ll < 1. Therefore, via an application of the Cauchy Schwarz 
inequality, the SVM objective [3] is equivalent to, 

A 2 

argminl — (co,cos) + — ||ce|| . 

cceC 2 

Setting A = 1 gives optimal solution co * = cos- Furthermore, for A = 1, 

/ , 1 n ,|2 1 n 112 1 H m2 

— \<x> r cos) + - lieu'll =2 ||<^ — co $||“ — - || cos || 

SVM objective Independent of u> 

Therefore the SVM objective [3] reduces to the herding objective, 

• ll 112 

arg mm 11 co — co$ | . 

u>eC 

Herding can thus be understood as the application of "greedy" algorithms 
presented in 11421111 101 to a sufficiently regularized SVM objective. 
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6.7 Sparsity Inducing Objectives versus Sparsity Inducing Algo¬ 
rithms 

Much of practical machine learning can be understood as solving regular¬ 
ized empirical loss problems, 

argmin-!- £(y, (to, <p(x)) + Q(o;), 

ven Pi { x ,y)eS 

with £ a loss and Q a regularizer. It is desirable for the evaluation speed 
of the outputted classifier that to be as sparse as possible. For example, 
the linear loss objective does not return a sparse solution. There are two 
main approaches to this problem. 

One can understand objectives that promote sparsity, via sparsity inducing 
losses or sparsity inducing regularizers. For example in the LASSO, the 
LI regularizer Cl (to) = A Y!i=\ \<^i\ is used EH . Alternately, Bartlett and 
Tewari in (6j| use the standard square norm regularizer, Cl(co) = \ |cc 11 2 , 
and vary the loss. They show there is an inherit trade off between sparse 
solutions, and solutions that give calibrated probability estimates. We 
point out that this is for this particular choice of regularizer. In the ob¬ 
jective based approach, properties of the actual minimizer are deduced 
from the KKT conditions of the relevant optimization objective. 

In practice, one rarely if ever returns the exact minimizer. Therefore, the 
search of objectives that have sparse minimizers does not tell the full story. 
The approach taken here, and in Il42l QJ [10], is to use an optimization algo¬ 
rithm that provides sparsity for free. 

In the context of learning with symmetric label noise, this further high¬ 
lights the importance of strong robustness. What is important is how the 
objective orders solutions, and not necessarily what the exact minimizer of 
the objective is. 

7 Conclusion 

We have taken a simple classifier, given by the sample mean, and have 
placed it on a firm theoretical grounding. We have shown its relation to 
maximum mean discrepancy, highly regularized support vector machines 
and finally to kernel density estimation. We have proven a surrogate regret 
bound highlighting its usefulness in learning classifiers, as well as gener¬ 
alization bounds for single and multiple feature maps. We have analysed 
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the robustness properties of the mean classifier, and have shown that lin¬ 
ear loss is the only convex loss function that is robust to symmetric label 
noise. Finally, we have shown how herding can be used to speed up its 
evaluation. The result is a conceptually clear, theoretically justified means 
of learning classifiers. 
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Additional Material 


8 Proof of Concept Experiment 

Here we include a proof of concept experiment, highlighting the perfor¬ 
mance of herding as a means of compressing data sets. Keeping up with 
the current fashion, we consider classifying 3's versus 8's from the MNIST 
data set, comprising 11982 training examples and 1984 test examples. We 
normalize all pixel values to lie in the interval [0,1] and use a Gaussian 
kernel with bandwidth 1. We plot the test set performance of the learned 
classifier as a function of the percentage of the training set used in the 
herd. To produce the dashed curve, we recursively herd with an allowed 
error of 0.01 (i.e. we herd the data set, and then the herd and so on). To 
produce the dotted curve, we recursively use parallel herding with an al¬ 
lowed error of 0.025 and a maximum number of 200 data points in each 
sub division. Each large dot signifies a herd. For both curves, we recurse 
until there are only 100 data points in the herd. As a baseline (in red), we 
plot the performance of the mean of the entire training set. 


Herding MNIST 

— Baseline (Full Training Set) —O-Herding (Parallel) •<>• Herding (Full Kernel Matrix) 



RATIO OF TRAINING SET USED IN APPROXIMATION 


Figure 1: Experiment on the MNIST data set highlighting herding's ability 
to compress data sets. Curves are produced by recursively running the 
herding algorithm (herding the data and then the herd and so on), see text 

(best viewed in colour). 

The baseline method achieves test set performance of 98.74%. Firstly, 
the curves for parallel and non-parallel herding are qualitatively the same. 
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We comment on the non-parallel herding. We see that with little as 1% of 
the training set, an accuracy of over 94% is obtained. The performance of 
the herded samples rapidly approaches that of the full mean. Less than 
20% of the training set affords an accuracy of over 97%. 

9 Long and Servedio Example 


\ 

\ 


\ 

\ 



Figure 2: Long and Servedio's example highlighting the non-robustness to 
label noise of hinge loss minimization. See text. 

Figure [2] details Long and Servedio's example highlighting the non¬ 
robustness to label noise of hinge loss minimization. The distribution P 
is concentrated on the blue points, with each point deterministically la¬ 
belled positive. The southern most point is chosen with probability and 
the other two points are chosen with probability The function class 
considered is hyper planes through the origin. Solving for, 

argminE (X/;/) ~p[l - (<*>,*)]+, 

weR 2 

yields the solid black hyperplane, which correctly classifies all points. 
Solving for, 

argminE (xy) ~p„[1 - (w,x)]+, 
w6R 2 

for sufficiently large j, yields the dashed black hyperplane, which incor¬ 
rectly classifies the southern most point. As this point is chosen with prob¬ 
ability this classifier performs as well as random guessing. The scale of 
the data set can be chosen so that this occurs for a arbitrarily small. In 
contrast the mean solution provides the red hyperplane, which correctly 
classifies all data points. 
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10 Proof of Theorem [l] 


Proof. It is well known that fp G argmin^^ ,- x 4i (P,/). From P define 
P x to be the marginal distribution over instances and i)(x) = P(Y = 1|X = 
x). Then, 


^linear (F,/) E(x,y)~pl Vfi x ) 

= E x ~ Px l + (l-2ti(x))f(x). 

Minimizing over / G [—1, l] x gives f(x) = —1 if 1 — 2fj(x) > 0 i.e. when 
ij(x) < j and f(x) = 1 otherwise. This proves the first claim. We have, 

4near(F,/p) = E x ~p x l - |(1 - 2j/(*))| . 


Therefore, 


4near(F,/) - linear(4 fp) = E x ~p x (l - 2t]{x))f(x) + |(l -2f](x))\ 

= R x ~ Px \(l-2ti(x)) \ - sign(2 rj(x) - 1) |(1 - 2rj(x))\f(x) 
= E x ~ Px | (1 - 2rj{x))\ (1 - sign(2^(x) - 1 )/(*))■ 

It is well known |34| that. 


4i (P,f) ~ 4i (PJp) = E x ~p x |(1 - 2//(x))| [sign(2;/(x) - l)/(x) < 0], 
We complete the proof by noting fu < 0] < 1 — v for v G [—1,1]. 

□ 


11 PAC-Bayesian Bounds for Linear Loss 

Here we develop general bounds for learning with linear loss. Theorems [3] 
and[5]are recovered as special cases. For the following, i will denote linear 
loss. 

Let T C R x . Denote the expected linear loss of / G T by £(P,f). We 
consider randomized algorithms A : U“ =1 (X x Y) n —> P(J 7 ). For any algo¬ 
rithm A, define the mean function A : U“ =1 (X x Y) n —r IR X , 

T(S)(x) =E f ~ A{s) f(x). 

For a distribution over functions Q G 'P(J-), define the doubly annealed loss, 

4 p(P,Q) = log(E ( ^ P E f^oe-M-yfW). 


22 


Preliminary version - December 17, 2015 


Theorem 19 (PAC-Bayes Linear Loss theorem). For all distributions P, (p : 
X —> PL, priors n, randomized algorithns A and fi > 0, 

Es~p«^(P,Vl(S)) < Es~p« 

Furthermore, with probability at least \ — 5 on a draw from S~P n with A, n and 
fi fixed before the draw, 


£(S,A(S)) + 


D KL (A(S),7 t) 
Bn 


£^(P,A(S))<£(S,A(S)) + 


Dkl(A(S),7t) + Iog(|) 

fin 


Proof This is theorem 2.1 of l45f for linear loss, coupled with the convexity 
of — log. 

□ 


We call tc the prior and A(S) the posterior. The decision maker is lucky 
(has a tighter bound), if Dkl(A(S), n) is small. For linear function classes 
we identify f^ G T ( p with its weight vector lo. We take A(S) G P('H) and 
with a slight abuse of notation define A( S) = ^4(5)0;. We have, 

A(S)(x) = ^cv~A(S){u,(p{x)) = ( A(S),<j>(x )) G Tty. 


The sample risk of the posterior distribution is determined by its mean. 
To exploit this, we focus on posteriors and priors of simple form, allowing 
exact calculation of the annealed loss and the KL divergence term. We 
assume n = A/’(o; 7r ,l) and .A(S) = W(.A(S),1). In words, priors and pos¬ 
teriors are normal distributions with identity covariance. This restriction 
and the following theorem lead to theorem [2] 

Theorem 20. For all distributions P, feature maps f, prior vectors co n G PL, 
sample dependent weight vectors A : (X x Y) n —> PL and fi > 0 such that 
\\(p{x)\\ < 1 Vx and ||A(S)|| < 1 VS, 


Eg~pn£(P, Vl(S) ) < Es~P« 


£{S,A(S)) 


M(S) - <w T 

fin 


Furthermore, with probability at least 1 — 5 on a draw from S~P' 1 with A, co n 
and fi fixed before the draw, 


£(P,A(S)) < £(S,A(S)) + 


M(S) - CO n \\ 
fin 


lo gG) 


+ fi. 


Proof. We begin with theorem [19] and the function class T,p. For priors and 
posteriors given by normal distributions. 


D kl (A{S),tc) = \\A(S)-a>„ 
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For the left hand side of the bound. 


= - i log(E (; ,, y) .pE^ VM(swe -« 1 -("'»('»)) 

where the final line follows from standard results on the moment generat¬ 
ing function of normal distributions. We can lower bound this quantity as 
follows. 


> - i logfE^pC-W-^^M*') - 2 

> E (*,y)~P 1 - (^( S )W<K*)) - P 

= 1 — (A(S),COp) ~ ft, 


where the first line follows as — log is a decreasing function and || <p(x) || < 
1 , and the second follows from lemma ?? of the appendix, which can be 
applied as by Cauchy-Schwarz, 


\l-(A(S),ycp(x))\ G [0,2], 


By theorem 19 we have, 

Es~p«l — (A(S),cop) — /3 < Es~p« 


l-{^(S),a. s ) + lL' 4<S) a ’” 112 


fin 


with a corresponding high probability version. 
To recover theorem [3j consider the algorithm, 

•A(S) = 1), 


□ 


with prior co n = 0. Upper bounding ||-A(S) — <jO n \\ 2 < 1 yields, 

1 T losr (—1 

£{P,cv s ) < £(S,co s ) +-+ jS. 

Finally, optimize over p. 
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PAC-Bayesian Bounds for Learning over Multiple Feature Maps 


It is common for the decision maker to have access to several feature maps 
(pi : X —>■ 'Hi, for i in a (possibly infinite) index set 1. Define, 


Fx = ^ieiFcpt, 

the disjoint union of the function classes J~ ( p t . Rather than priors and pos¬ 
teriors on a single J-,p : , we consider distributions on J~x that are mixtures 
of normals. 


A(S) = i~a(S), (o l ~N(A\S), 1) 

7i = i~a. n , (o t ~N(co l n ,l), 

where AfS) G Hi and u. n ,oc{S ) G P(I). These distributions first pick 
a tag i and then generate a weight vector to 1 G Hj. 


Theorem 21. For all distributions P r collections of feature maps (pi, prior weights 
<x n G P(Z), prior vectors co l n G Hi, sample dependent zveights a(S) G P(X), 
sample dependent weight vectors AfS) G Hi and fi > 0 such that ||0 (jc)|| < 
1 \/x and || A ! (S) || < 1 VS, 

Es~P"P;~fl;(S)^(P/ ^ ! (S)) 


< E§- 


*pn 




DklA{S),0C7t) +E^ (s) ||A , (S) -tv n I 
fin 


A- 


Furthermore, with probability at least l — S on a draw from S~P n with A', oo l n 
and fi fixed before the draw, 


E i^ [S) £(P,A l (S)) 
<E i ^ a{S /(S,A t (S)) + 


D KL (a(S),0(n) + E^ a(s) ||A , (S) ~CV n \ 
fin 


Proof. The proof proceeds in very similar fashion to that of the previous 
theorem. We begin with theorem [T9] and the function class J~x- By simple 
properties of the KL divergence fl3| , for priors and posteriors given by 
mixtures of normal distributions. 


D KL (A(S),n ) = D KL (cc(S),cc 7l ) + ^ a (s)\\A l {S) - cv n \\ 2 . 
For the left hand side of the bound. 


-0(1 -(w,#(x)» 


n l°§(P(*,i/)~pPa;~»4(Sb 
_ \ log(E ( x,y)~pE!~« ( s ) e_ ^^ 1_ ^^ S ^ ! ^^ + ^^ i ^^ Z )/ 
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where the final line follows from standard results on the moment generat¬ 
ing function of normal distributions. We can lower bound this quantity as 
follows. 


> - ilogfE^pE^sje-W-W^W))) - | 

>E(*,y)~P E i~« ( S ) 1 - (^(S),#(x)) - £ 

= E ;~ a (s)! - {A{S),co P ) -p, 

where the first line follows as — log is a decreasing function and || (p(x) || < 
1, and the second follows from lemma ?? of the appendix, which can be 
applied as by Cauchy-Schwarz, 


\l-(A(S),ycp(x))\ E [0,2] 


By theorem 19 we have, 

e s~p hE !~«;(s) 1 _ (^4'(S),o;p) - p 


< Eg- 


-P" 




Dkl{k{S),Ktt) + E «~« ( S)M I '(S) - Vn\ 
IBn 


with a corresponding high probability version. 

□ 


To recover theorem [5j consider the algorithm with, 

A\S)=M{cv i s ,l), 

and a(S) placing all mass on the feature map with minimum 1 — | cug 11. 
Using prior, w l n = 0 and ct n the uniform distribution of [1 ;k\ and upper 
bounding, 

M'(S) -VnW 2 < 1 and D KL (oi(S) r a. n ) < log (k), 


yields, 

£(P,cv* s ) < £(S,cog) 

Finally, optimise over (5. 


i + iogW + iogQ) 

pn 


P- 
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12 Proof of Theorem [6| 

Before the proof we prove the following simple lemma. 

Lemma. Let v, v E R with \v — v\ < e. Then v < 0 implies v < e. 

Proof. We have v — e <v <v + e. If v < 0, then v — e < 0. 

□ 


We now prove the theorem. 

Proof. First we prove the forward implication. By the conditions of the 
theorem, \f(x) — f(x)\ < e for all x EX, meaning \yf(x) — yf(x)\ < e for 
all pairs (x,y). By the previous lemma, yf(x) < 0 implies yf(x) < e. This 
means, 

lyf(x) < 0] < I yf(x) < e]. 

Averaging over P yields the desired result. For the reverse implication, 
define the function. 


/(*) 


0 :|/W|<e 

f{x) : \f(x)\ > e 


By simple calculation |/— f II < e and Iq\{P, f) 
sumption, Lqi (P/f) < a - Therefore £ e (P,f) < a. 


4 (P,/)- By as- 
□ 


13 Comparison with Makovoz's Theorem 

We call to E co (S) m-sparse if it is a combination of only m elements of S. 
Makovoz's theorem is an existential result concerning the degree to which 
one can approximate any to E co(S), with an m-sparse approximation to m . 
Let {B(z,,e)}” = i be a collection of n balls in TL. We say such a collection of 
balls covers S if S C U" =1 B(z;,e). We call e the radius of the cover. Define 
the nth entropy number of S as, 

e m (S) := inf{e : 3 a cover of S with radius e and n < m}. 

The entropy number of S is a fine grained means to assess its complex¬ 
ity. Intuitively, the simpler S is the faster e n (S) decays as n —> oo. The 
following is theorem 27 in If23l 
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Theorem 22. Let TLbe a Hilbert space of dimension d. Then for all finite S QTL, 
for all co G co(S), and for all even m < |S| there exists an m-sparse do G co(S) 
such that, 


lev — d)|| < 


V2e,n(S) 

\Jm 


Theorem 22 has an advantage over the analysis in section 6.1 It in 


eludes more information about the sample than just the diameter of S and 
the distance from the sample mean to the boundary of S in the form of 
the entropy numbers of S. It is known for S the d-dimensional unit ball, 
m~a < € m ( S) < 4 m~d (see equation 1.1.10 of HI). Naively, this means 
theorem [ 22 ] gives rates of convergence. 


leu — cull < 


4\/2 

i+I' 

mGj 


where d can be replaced by |S| for infinite dimensional problems. This 
suggests that herding outperforms the bound in theorem 22 Ideally one 
wants a version of equation 2 that has direct reference to the entropy num¬ 
bers of S. This will be the subject of future work. 


14 Proof of Lemma [12] and Theorem Il3l 


Before the proofs we require the following lemma. 

Lemma 23. Let i\ and i 2 be loss functions. i\ and i 2 are order equivalent if and 
only if there exists constants a. > 0 and f such that, 

h{y,v) = odi{y,v) + f>. 

This is theorem 2 of section 7.9 in 1(141 . We now prove lemma [l2| 

Proof We begin with the reverse implication. Since, 

E Ky)~Q%w) =^{v, y )^ Q /a(y,v), VQ, Q', 

we have Q <i Q' Q a <( (r Q' a . As we assume i and ip- are order 
equivalent, Q a <4 Q[ r <=> Q a < t Qf Therefore, 

Q<rQ' <4> Qp <(_ Q' a . 


For the forward implication, define the loss i' with. 


(n-^v) 
V t'frv) 


1 — j a 
a 1 — a 


£(-l,v) 

i(l,v) 


Mv G R. 
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It is easily verified that £' a = £. This means. 


E (l7 , y )~Q/(y,y) =E (v,y)~ Q Ay>v), VQ,Q', 
but as Q < e Q' 44> Q a < (: Q' a , we have, 

Q <e Q! & Q <£> Q'- 


Therefore £ and £' are order equivalent. Invoking lemma 23 and the defi¬ 
nition of £! yields. 


I 1-cr a \ / £{-l,v) 
V cr 1 - cr ) ^ £{\,v) 

for a. > 0. This yields. 


= 0L 


£(l,v) 


, Vu G R, 


£(-l,v) 

£(l,v) 


1 — O’ —cr 


= oc 


1 — 2(7 l —cr 1 — cr 


£(-l,v) 


+P 


, Vu G R. 




Therefore i is order equivalent to i 0 


We now prove theorem [13| 


□ 


Proof. As £ and £ a are order equivalent, by the lemma 23 £ a (y,v) = 
oc£(y,v ) + p. Combined with the definition of £ a yields. 


(1 - cr)£(y,v) - cr£(-y,v) 


= oc£(y,v) + p. 


1-2(7 

Setting y = ±1 yields the following two equations, 

(1 — a)£(l,v) — a£{—l,v) = (1 — 2cr)(a.£(l,v) + p) 

(1 — a)£(—l,v) — cr£(l,v) = (1 — 2cr)(oc£(—l,v) + ft). 


( 4 ) 

( 5 ) 


Adding these two equations together and dividing through by 1 — 2a 
yields, 

£{l,v) +£(-!,v) = a(£( l,v) + £(-!, v)) +2/3. (6) 

If oc / 1, £(l r v) + £(—l,v) = = C and the proof is complete. If a = 1, 

P = 0 by 4.6. Inserting these values into 4.4 yields, 

(1 — cr)£( l,v) — <j£{— l,v) = (1 — 2a)£(l,v). 
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Thus £{l r v) = £(—l,v), an excluded pathological case. For the converse, 
if t(y,v) + £{—y,v) = C then £(—y,v) = C — £(y,v). This means. 


£<r{y,v) = 


(1 - o-)£(y,v) - <r£{-y,v) 

1-2(7 

(1 - tr)£(y,y) - cr(C - £(y,v)) 

1-2(7 

1 of \ aC 

- V) ~T- 


1-2(7 

and thus by the above lemma, £ and £ a are order equivalent. 


□ 


15 Proof of Theorem H4l 

Proof. We begin with the forward implication. We have £{y,v) is con¬ 
vex in v, furthermore £{y,v) + £{—y,v) = C. This means £{y,v) = C — 
£{—y,v), hence — £(— y,v) is convex. Thus as £{y,v) and —£{y,v) are con¬ 
vex, £(y,v) = OLyV + g(y). But, 

£{y,v) + £{-y,v) = ol v v + g(y) + cc- y v + g(-y) 

= («y + 0 L- y )v + g{y) + g{-y) 

= C. 

Therefore x~y = —x y = A and £(y, v) = Xyv + g(y). For the converse, if 
£{y,v) = Ayv + g{y) r then, 

£{y, v) +£(-y,v)= g{y) + g{-y) = C. 

□ 


16 Proof of Theorem H5l 


Proof. If (7\£{—1,v) + cr-\£{ 1 ,v) = C, this means cr- y £{y,v ) + cr y £(—y,v) = 

C for all y. This yields, 

* _ (! -<r-y)£(y,v)-o-y£(-y,v) 

x _ ai _ ai 

= (1 - a- y )£{y,v) - (C - (7_ y l(y,p)) 

1 — ( 7 _! — C7\ 

= i --- t(y,v) ~ - -—-, 

1 — (7—\ ~ (71 1 — C7—i — (7"i 

where the first line is the definition of £ a 1/l7l (y, v) and the second is by 


assumption. By lemma 23 £a_ 1 ,a 1 and i are order equivalent. 


□ 
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17 Proof of Theorem [16] 

Proof. Recall the balanced error, 

BER t (P+,P-,f) = ^E x ~ P+ £(l,f(x)) + ±E X ~ P _£(-I,f(x)). 

Remember that, 

P+ = (1 - a)P+ + aP_ and P_ = j 6 P+ + (1 - 0)P_. 

This means for all classifiers /, 

E^p + f?(l,/(x)) = (1 - u)E x ~ P+ £(l f f(x)) + «E*~ P _^(l,/(ar)) 

= (1 - a.)E x ~ P+ £(l,f(x)) - oiE x ~ P _£(-l,f(x)) + Ca, 

where in the second line we have used the fact that ,v) = C — £(— 1, v). 
Similarly, 

E x ^p £(-l,f(x)) = -pE x ~ P+ £(l,f(x)) + (1 - j8)E x ~p_^(—1, f{x)) + C/3. 

Taking the average of these two equations yields, 

BER e {P+,P-,f) = (l-x- /3)BER e {P+,P-,f) + 

□ 


18 Proof of Theorem Il7l 

Proof. Firstly, for all classifiers /, 

*(P<r,f) = ^(x,y)~p( 1 ~o'(x,y))£(yJ(x))+(T(x / y)£(-y / f(x)) 

= E(x,y)~p(l - v{x,y))£(y,f(x)) + cr(x,y)(C - £{y,f(x))) 

= E (x, y )~p(l - 2 a(x,y))£(y,f{x)) + CE {Xiy] ^ P a(x,y), 

where in the second line we have used the fact that £(1, v ) + £{— 1, v ) = C. 
Now let. 


fa = argmin L(P a ,f) and f* = argmin L(P,/), 

/s-P /e.F 

respectively By definition ,£(Pa,fa) < £(Pa,f*)- Combined with the above 
this yields, 

E (x, y )~p(l - 2cr(x,y))£(i/,/*(x)) < E (jc#y) „p(l - 2a(x,y))£(y, f*(x)). 
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From the assumption that cr(x, y) < \ for all (x, y) E X x Y, 

min 1 — 2 a(x, y) < 1 — 2 j{x, y) < 1, V(x,y) E X x Y. 

(w) 

This yields. 


(mini -2cr(x,y)) E ( * #y) ~p%,/£(*)) < E (X/y) ^ P £(y,/*(x)), 


and the proof is complete. 


□ 
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